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Abstract 



1 INTRODUCTION 



In many applications that require matrix 
solutions of minimal rank, the underlying 
cost function is non-convex leading to an 
intractable, NP-hard optimization problem. 
Consequently, the convex nuclear norm is 
frequently used as a surrogate penalty term 
for matrix rank. The problem is that in 
many practical scenarios there is no longer 
any guarantee that we can correctly estimate 
generative low-rank matrices of interest, the- 
oretical special cases notwithstanding. Con- 
sequently, this paper proposes an alterna- 
tive empirical Bayesian procedure build upon 
a variational approximation that, unlike the 
nuclear norm, retains the same globally min- 
imizing point estimate as the rank func- 
tion under many useful constraints. How- 
ever, locally minimizing solutions are largely 
smoothed away via marginalization, allow- 
ing the algorithm to succeed when standard 
convex relaxations completely fail. While 
the proposed methodology is generally ap- 
plicable to a wide range of low-rank ap- 
plications, we focus our attention on the 
robust principal component analysis prob- 
lem (RPCA), which involves estimating an 
unknown low-rank matrix with unknown 
sparse corruptions. Theoretical and empir- 
ical evidence are presented to show that 
our method is potentially superior to related 
MAP-based approaches, for which the con- 
vex prin ciple component pur suit (PCP) al- 
gorithm (jCandes et al.l . 120111 ) can be viewed 
as a special case. 



Recently there has been a surge of interest 
in finding low-rank decompositions of matrix- 
valued data subject to some problem-specific con- 
straints (iBabacan et al.1. 2011 : Candes &: Rechtl . 2008 : 



Candes et aLl. 1201 It IChandrasekaran et al.l . 12011 , 
Ding et aLr2011 ). While the methodology proposed 
herein is applicable to a wide range of low-rank appli- 
cations, we will focus our attention on the robust prin- 
ci pal componeii t analy sis problem (RPCA) described 
in lCandes et al.l ( 20111 ). We begin with the observation 
model 

Y = X + S + E, (1) 

where Y G jg an observed data matrix, X is an 

unknown low-rank component, is a sparse corrup- 
tion matrix, and E is diffuse noise, with iid elements 
distributed as A^(0, A). Without loss of generality, we 
will assume throughout that n > m. To estimate X 
and S given Y, one possibility is to solve 



mmhlY - X - S\ 
x,s A 



iRsink[X] + \\S\\a, (2) 



where ||S'||o denotes the matrix £o norm of S, or a 
count of the nonzero elements in S. The reason for 
the factor of n is to ensure that the rank and spar- 
sity terms are properly balanced, meaning that both 
terms range from to nm, which reflects our balanced 
uncertainty regarding their relative contributions to 
Y. Unfortunately, solving ([2]) is problematic because 
the objective function is both discontinuous and non- 
convex. In general, the only way to guarantee that the 
global minimum is found is to conduct an exhaustive 
combinatorial search, which is intractable in all but 
the simplest cases. 

The most common alternative, s ometimes called prin - 
ciple component pursuit (PCP) (jCandes et al.l . l201ll) . 
is to replace ([2]) with a convex surrogate such as 



mmhY~X-Sr^ + V^\\X\ 

A ,o A 



l^lli, (3) 



where denotes the nuclear norm of X (or the 

sum of its singular values). Note that the scale factor 
from ^ has been changed from n to ^/n; this is an 
artifact of the relaxation mechanism balancing the nu- 
clear and ii normslll A variety of recent theoretical re- 
sults stipulate when solutions of (jS]), particularly in the 
limiting case where A — (reflecting the assumption 
that E = 0), will produ ce reliable estim ates of X an d 



S { Candes et al. , 201 ll Icha ndrasekara n et al. , 2011 ) 



However, in practice these results have marginal value 
since they are based upon strong, typically unverifi- 
able assumptions on the support of S and the struc- 
ture oi X. In general, the allowable support of 5 may 
be prohibitively small and unstructured (possibly ran- 
dom); related assumptions are required for the rank 
and structure of X. Thus there potentially remains 
a sizable gap between what can be achieved by min- 
imizing the 'ideal' cost function ([2]) and the convex 
relaxation (jSj. 

In Section [21 as a motivational tool we discuss a 
simple non-convex scheme based on a variational 
majorization-minimization approach for locally min- 
imizing (l2|). Then in Section |3] we reinterpret this 
method as maximum a posteriori (MAP) estimation 
and use this perspective to design an alternative em- 
pirical Bayesian algorithm that avoids the major short- 
comings of MAP estimation. Section 2] investigates 
analytical properties of this empirical Bayesian alter- 
native with respect to globally and locally minimizing 
solutions. Later in Section [5] we compare a state-of- 
the-art PCP algorithm with the proposed approach on 
simulated data as well as a photometric stereo prob- 
lem. 

2 NON-CONVEX MAJORIZATION- 
MINIMIZATION 

One possible alternative to ^ is to replace © with 
a non-convex yet smooth approximation that can at 
least be locally minimized. In the sparse estimation 
literature, one common substitution for the Iq norm is 
the Gaussian entropy measure X]zlog|si|, which may 
also sometimes include a small regularizer to avoid tak- 
ing the log of zeroU This can be justified (in part) by 
the fact that 



: lim (1^ 
p-s-O p ^ 



d^-l)«ll^llo. (4) 



^Actually, different scaling factors can be adopted to re- 
flect different assumptions about the relative contributions 
of low-rank and sparse terms. But we assume throughout 
that no such knowledge is available. 

^Algorithms and analysis foUow through much the same 
regardless. 



An analogous approximation suggests replacing the 
rank penalty with log |AA-^| as has been suggested for 
related rank minimization problems (Mohan & FazeJ, 
since log|AA^| = log cr^, where oi are the 
singular values of XX^ . This leads to the alternative 
cost function 



min-||y-A-5||j.+nlog|AA^ 



-2^1og|s, 

i,3 



(5) 



Optimization of ([S]) can be accomplished by a straight- 
forward majorization-minimization approach based 
upon v ariational bounds o n the non-convex penalty 
19991) . e 



terms (1 Jordan et al 



log I s I is a concave fu nction of , it can be expres sed 
using duality theory ( Bovd fc Vandenberghel 2004 ) as 
the minimum of a particular set of upper-bounding 
lines: 



logs^ 



min h log 

7>0 7 



7-1. 



(6) 



Here 7 is a non-negative variational parameter con- 
trolling the slope. Therefore, for any fixed 7 we have 
a strict, convex upper-bound on the concave log func- 
tion. Likewise, for th e rank term we ca.n use the anal- 
ogous representation ( Mohan fc Fazel . 2010l ) 



nloglXA^I = min Trace [AA"^*"^] + nlog I^-I C, 

^ ^ (7) 

where C is an irrelevant constant and VE* is a positive 
semi-definite matrix of variational parameters^ Com- 
bining these bounds we obtain an equivalent optimiza- 
tion problem 



min \\\Y-X-S\ 



— +log7y 



+ Trace [AA'^^'-i] +n log 1*1, (8) 

where P is a matrix of non-negative elements composed 
of the variational parameters corresponding to each 
Sij. With P and ^ fixed, (|8|) is quadratic in X and S 
and can be minimized in closed form via 



(9) 



where yj, Xj^ and Sj represent the j-th columns of 
A, and S respectively and P^ is a diagonal matrix 
formed from the j-ih. column of P. Likewise, with A 
and S fixed, P and ^' can also be obtained in closed 
form using the updates 



1 



-AA' 



(10) 



If X is full rank, then 4' must be positive definite. 



While local minimization of (O is clear cut, finding 
global solutions can still be highly problematic just as 
before. Whenever any coefficient of S goes to zero, or 
whenever the rank of X is reduced, we are necessarily 
at a local minimum with respect to this quantity such 
that we can never increase the rank or a zero-valued 
coefficient magnitude in search of the global optimum. 
(This point will be examined in further detail in Sec- 
tion 21) Thus the algorithm may quickly converge to 
one of a combinatorial number of local solutions. 

3 VARIATIONAL EMPIRICAL 
BAYESIAN ALGORITHM 

From a Bayesian perspective we can formulate ([5]) as 
a MAP estimation problem based on the distributions 



p{Y\X,S) cx exp 
p{X) cx 



2A 



\Y-X~Sf^ 



|^^T|»/2 



(11) 



It is then transparent that solving 



max]5(X, S\Y) = maxp(y|X, S)p{X)p{S) (12) 

is equivalent to solving ([5]) after an inconsequential 
— 21og(-) transformation. But as implied above, this 
strategy is problematic because the effective posterior 
is characterized by numerous spurious peaks render- 
ing MAP estimation intractable. A more desirable 
approach would ignore most of these peaks and focus 
only on regions with significant posterior mass, regions 
that hopefully also include the posterior mode. One 
way to accomplish this involves using the bounds from 
([6|) and ^ to construct a simple approximate poste- 
rior that reflects the mass of the original p{X,S\Y) 
sans spurious peaks. We approach this task as follows. 



From 



and (O) we can infer that 

p(S) cx maxi6(5;r) 

r>o 

p{X) cx maxp(X;*) 

'I'>-0 



(13) 
(14) 



where 



exp 



4i: 



— +log7»j 



(15) 



p{X; *) = exp 



-^Trace [XX^^-'] ^ ^log\^\ 



which can be viewed as unnormalized approximate pri- 
ors offering strict lower bounds on p{S) and p{X). We 



also then obtain a tractable posterior approximation 
given by 



p{X,S\Y;T,^) 



p{Y\S,X)p{S;T)p{X;^) 
J p(Y\S,X)p{S;T)p{X;^)dSdX' 

(16) 

Here p{X, S\Y; F, ^) is a Gaussian distribution with 
closed- form first and second moments, e.g., the means 
of S and X are actually given by the righthand sides 
of (|n]). The question remains how to choose F and 
^f. With the goal of reflecting the mass of the true 
distribution ip(y, X, S), we adopt the approach from 
Wipf et al.l pOlll ) and attempt to solve 



min / \p(Y,X,S) - p(Y\S,X)p(S;r)p(X;^)\dXdS 
*.r J 

(17) 



= min / p(Y\S,X) \p(X)p(S) - p(S;r)p(X;^)\dXdS. 
*,r J 

(18) 

The basic idea here is that we only care that the 
approximate priors match the true ones in regions 
where the likelihood function p{Y\X,S) is significant; 
in other regions the mismatch is more or less irrele- 
vant. Moreover, by virtue of the strict lower varia- 
tional bound, ([T8|) reduces to 



max / p(Y\S,X)p(S;r)p(X;^)dXdS = mmC(-^,r) 
-ii.r J "P.r 

(19) 

where 



/:(vI/,F)^^[yJl]-/y,+log|l],J 



with 



XI. 



(20) 



(21) 



This Ey^. can be viewed as the covariance of the j-th 
column of Y given fixed values of 4* and F. To re- 
cap then, we need now minimize £(^,F) with respect 
to ^ and F, and then plug these estimates into ([T6|) 
giving the approximate posterior. The mean of this 
distribution (see below) can then be used as a point 
estimate for X and S. This process is sometimes re- 
ferred to as empirical Bayes because we are using the 
data to guide our search for an op timal prior distribu- 
tion (jBergeil Il985t iTippingl . l200l[ ). 



3.1 UPDATE RULE DERIVATIONS 



It turns out that minimization of ()20|) can be accom- 
plished concurrently with computation of the poste- 
rior mean leading to simple, efficient update rules. 
While (|20p is non-convex, we can use a majorization- 
minimization approach analogous to that used for 
MAP estimation. For this purpose, we utilize simplify- 
ing upper bounds on both terms of the cost function as 



has been done for related sparse estimation problems £(5',r) has been conveniently decoupled and we need 
Wipf fc Nagaraian ( 201Clt ). only compute 



First, the data-dependent term is concave with re- 
spect to VP^^ and and hence can be expressed as a 
minimization over r~^)-dependent hyperplanes. 

With some linear algebra, it can be shown that 



(22) 

for all J. With a slight abuse of notation, we adopt 
X = [xi, . . . ,Xn] and S = [si,...,s„] as the varia- 
tional parameters in (j22p because they end up play- 
ing the same role as the unknown low-rank and sparse 
coefficients and provide a direct link to the MAP es- 
timates. Additionally, the Xj and Sj which minimize 
turn out to be equivalent to the posterior means 
of (|16l) given and F and will serve as our point esti- 
mates. 

Secondly, for the log-det term, we first use the deter- 
minant identity 



miny^ {x'^-^'^Xj + Trace \Uj^~'^]) +nlog|^'| 



and 



+ 



(27) 



+ log7,,, Vi,j. (28) 



We summarize the overall procedure next. 

3.2 ALGORITHM SUMMARY 

1. Compute 4 

2. Initialize kI, and vf^ kI for all j. 

3. For the (k + l)-th iteration, compute the optimal 
Xj and Sj via 



log I'i + F, + A/ = log 1*1 + log F,- 



where 



log I A, 



A, ^ A- 



1 


■ / 


/ ■ 




■ ^-1 " 




/ 


/ 


+ 


. f7l_ 



(23) 



(24) 



and C is an irrelevant constant. The term log \Aj\ is 
jointly concave in both ^P"^ and f ~^ and thus can be 
bounded in a similar fashion as ([2^ . although a closed- 
form solution is no longer available. (Other decompo- 
sitions lead to different bounds and different candidate 
update rules.) Here we use 

log I A, I = (25) 

^mn Jrace [Uf^-' + V^j'] ~ h*{U,,V,) 

where h*{Uj,Vj) is the concave conjugate function of 
log \Aj \ with respect to ^I'^^ and f^^. Note that while 
h*{Uj, Vj) has no closed- form solution, the minimizing 
values of Uj and Vj can be computed in closed-form 
via 



dlog\A, 



d\og\A, 



(26) 



When we drop the minimizations over the variational 
parameters Xj, Sj, Uj, and Vj for all j, we arrive at 
a convenient family of upper bounds on the cost func- 
tion £(*, F). Given some estimate of and F, we can 
evaluate all variational parameters in closed form (see 
below). Likewise, given all of the variational param- 
eters we can solve directly for \[' and F because now 



5('=+i)^fW(vi/W+ff +A/) y, (29) 
4. Likewise, compute the optimal Uj and Vj via 



^.C^+i) ^ f f - f f ) (vl/W+ff +A/ 



j=,(fc) 

(30) 



5. Update 5* and F using the new variational param- 
eters via 



(fc+i) 



E 



(fc+i) 



v^ 



(fc+i) 



u. 



,Vz,i (31) 



6. Repeat steps 3 through 5 until convergence. (Re- 
call that Tj is a diagonal matrix formed from the 
j-th column of F.) This process is guaranteed to 
reduce or leave unchanged the cost function at 
each iteration. 



Note that if we set ^/j^+i)^ v^(fc+i) 



for all 

J, then the algorithm above is guaranteed to (at 
least locally) minimize the MAP cost function from 
( 5l) ■ Additionall y , for matrix completion problems 



Candes fc Rechtl l2008l) . where the support of the 



sparse errors is known a priori, we need only set each 



7ij corresponding to a corrupted entry to oo. This lim- 
iting case can easily be handled with efficient reduced 
rank updates. 

One positive aspect of this algorithm is that it is 
largely parameter free. We must of course choose some 
stopping criteria, such as a maximum number of itera- 
tions or a convergence tolerance. (For all experiments 
in Section [5] we simply set the maximum number of 
iterations at 100.) We must also choose some value for 
A, which balances allowable contributions from a dif- 
fuse error matrix although frequently methods have 
some version of this parameter, including the POP al- 
gorithm. For all of our experiments we simply choose 
A = 10"^ since we did not include an E component 
consistent with the original RPCA formulation from 
Candes et all (|201ll ). 



From a complexity standpoint, each iteration of the 
above algorithm can be computed in O(m'^n), where 
n > m, so it is linear in the larger dimension of Y and 
cubic in the smaller dimension. For many computer 
vision applications (see Section [5] for one example) , 
images are vectorized and then stacked, so Y may be 
m =number-of-images by n = number-of-pixels. This 
is relatively efficient, since the num ber of images m ay 
be on the order of 100 or fewer (see lWu et aP (|2010l )). 
However, when Y is a large square matrix, the up- 
dates are more expensive to compute. In the future we 
plan to investigate various approximation techniques 
to handle this scenario. 

As a final implementation-related point, when given 
access to a priori knowledge regarding the rank of X 
and/or sparsity of S, it is possible to bias the algo- 
rithm's initialization (from Step 1 above) and improve 
the estimation accuracy. However, we emphasize that 
for all of the experiments reported in Section [5] we as- 
sumed no such knowledge. 

3.3 Alternative Bayesian Methods 

Two other Bayesian-inspired methods have recently 
been prop osed for solving t he RPCA problem. The 
first from iDing et al.l (|201lh is a hierarchical model 
with conjugate prior densities on model parameters at 
each level such that inference can be performed using 
a Gibbs sampler. This method is useful in that the 
A parameter balancing the contribution from diffuse 
errors E is estimated directly from the data. More- 
over, the authors report significant improvement over 
PCP on example problems. A potential downside of 
this model is that theoretical analysis is difficult be- 
cause of the underlying complexity. Additionally, a 
large number of MCMC steps are required to obtain 
good estimates leading to a significant computational 
cost even when Y is small. It also uses an estimate of 



Rank[X] which can effect the convergence rate of the 
Gibbs sampler. 

A second method from lBabacan et"aL ( 201ll ) similarly 
employs a hierarchial Bayesian model but uses a fac- 
torize d mean- field variational approximation for infer- 
20001) . Note that this is an entirely dif- 



ence (Attias 



ferent type of variational method than ours, relying on 
a posterior distribution that factorizes over X and S, 
meaning p(X,5|r) « q{X\Y)q{S\Y), where qiX\Y) 
and q{S\Y) are approximating distributions learned by 
minimizing a free energy-based cost function0 Unlike 
our model, this factorization implicitly decouples X 
and 5 in a manner akin to MAP estimation, and may 
potentially produce more locally minimizing solutions 
(see analysis below). Moreover, while this approach 
also has a mechanism for estimating A, there is no 
comprehensive evidence given that it can robustly ex- 
pand upon the range of corruptions and rank that can 
already be handled by PCP. 

To summarize both of these methods then, we would 
argue that while they offer a compelling avenue for 
computing A automatically, the underlying cost func- 
tions are substantially more complex than PCP or our 
method rendering more formal analyses somewhat dif- 
ficult. As we shall see in Sections|4]and[5l the empirical 
Bayesian cost function we propose is analytically prin- 
cipled and advantageous, and empirically outperforms 
PCP by a wide margin. 

4 ANALYSIS 

In this section we will examine global and local minima 
properties of the proposed method and highlight po- 
tential advantages over MAP, of which PCP can also 
be interpreted as a special case. For analysis purposes 
and comparisons with MAP estimation, it is helpful 
to convert the empirical Bayes cost function ([20l) into 
{X, S')-space by first optimizing over Uj, Vj, 4* and F, 
leaving only the unknown coefficient matrices X and 
S. Using this process, it is easily shown that the es- 
timates of X and S obtained by globally (or locally) 
minimizing ()20|) will also globally (or locally) minimize 



minllr- a:-5| 



+ XgEB{X,S;X), (32) 



where the penalty function is given by 
gEBiX,S;X)^ (33) 



n 

min Va;,^'"^a;,+s;ff7^s,+log|^' + f,- + A/| 

^ ^ i=l 



''Additional factorizations are also included in the 
model. 



Note that the implicit MAP penalty from ^ is nearly 
identical: 

gmapiX, S) ^ (34) 



mm 

r>o 



in > x.j'^ 



s,+log|^'|+log|fj 



The primary distinction is that in the MAP case the 
variational parameters separate whereas in empirical 
Bayesian case they do not. (Note that, as discussed 
below, we can apply a small regularizer analogous to A 
to the log terms in the MAP case as well.) This implies 
that gmap{X, S) can be expressed as some gmap{X) + 
gmap{S) whereas gEB{X, S; A) cannot. A related form 
of non-separability has been shown to be advantageous 
in the conte xt of sparse estima tion from overcomplete 



dictionaries (jWipf et al.l . 120111 ) 



We now examine how this crucial distinction can be 
beneficial in producing maximally sparse, low-rank so- 
lutions that optimize ([2]). We first demonstrate how 
(|32l) mimics the global minima profile of ([2]) . Later we 
show how the smoothing mechanism of the empirical 
Bayesian marginalization can mitigate spurious locally 
minimizing solutions. 



The o riginal RPCA development from iCandes et al 
assumes that E = 0, which is somewhat easier 
to analyze. We consider this scenario first. 

Theorem 1. Assume that there exists at least one so- 
lution to y = X+S such that Rank max j |jsj||o < 
m. Then in the limit as A ^ 0, any solution that glob- 
ally minimizes (j32p will globally minimize 



Proofs will be deferred to a subsequent journal pub- 
lication. Note that the requirement Rank[A'] + 
maxj ||sj||o < m is a relatively benign assumption, be- 
cause without it the matrices X and S are formally 
unidentifiable even if we are able to globally solve © . 
For E > 0, we may still draw direct comparisons be- 
tween p2)) and ([2|) when we deviate slightly from the 
Bayesian development and treat gEsiX, S; X) as an 
abstract, stand-alone penalty function. In this context 
we may consider gEsiX, S; a), with a ^ A as a more 
general candidate for estimating RPCA solutions. 

Corollary 1. Assume that A^(a) and S(^x) are a 
unique, optimal solution to ([2|) and that Rank [A"(a)] + 
maxj II llo < Then there will always exist 

some A' and a' such that the global minimum of 

min \\Y -X- S\\jr + X'gEsiX, S] a'), (35) 

X ,s 



denoted X 



X 



(A' 



(A' 

X, 



w\ 



and S(\i 
I < e and 1 1 S*, 



satisfies the conditions 



e can be arbitrarily small. 



(A',a') 



S(x) II < e, where 



Of course MAP estimation can satisfy a similar prop- 
erty as Theorem [T] and Corollary [T] after a minor mod- 
ification. Specifically, we may define 



.iX,S;a)^ 



(36) 



-log|* 



-iog|r, 



min y^a;,^' ^Xj 

^ ' ^ j=i 

and then achieve a comparable result to the above us- 
ing gmapiX, S; a'). The advantage of empirical Bayes 
then is not with respect to global minima, but rather 
with respect to local minima. The separable, addi- 
tive low-rank plus sparsity penalties that emerge from 
MAP estimation will always sufl^er from the following 
limitation: 

Theorem 2. Let S'|°'' denote any matrix S with Sij = 
a. Now consider any optimization problem of the form 



mingi(X) 



32(5), s.t.Y = X + S, 



(37) 



where gi is an arbitrary function of the singular values 
of X and 172 is an arbitrary function of the magnitudes 
of the elements in S. Then to ensure that a global 
minimum of (j37p is a global minimum of ([2]) for all 
possible Y, we require that 



lim 

£^0 



52 



52 



(0) 



00 



(38) 



for all i and j and S. An analogous condition holds 
for the function gi. 



This result implies that whenever an element of S 
approaches zero, it will require increasing the asso- 
ciated penalty g2iS) against an arbitrarily large gra- 
dient to escape in cases where this coefficient was in- 
correctly pruned. Likewise, if the rank of X is prema- 
turely reduced in the wrong subspace, there may be no 
chance to ever recover since this could require increas- 
ing gi{X) against an arbitrarily large gradient factor. 
In general. Theorem [2] stipulates that if we would like 
to retain the same global minimum as ([2|) using a MAP 
estimation-based cost function, then we will necessar- 
ily enter an inescapable basin of attraction whenever 
either Rank [A"] < m or ||sj||o < m for some j. This 
is indeed a heavy price to pay. 

Crucially, because of the coupling of low-rank 
and sparsity regularizers, the penalty function 
gEB{X, S] A) does not have this limitation. In fact, we 
only encounter insurmountable gradient barriers when 
Rank[Ar] 4- \\sj\\o < rn for some j, in which case the 



covariance T,y. from (PT|) becomes degenerate (with A 
small), a much weaker condition. To summarize (em- 
phasize) this point then, MAP can be viewed as heav- 
ily dependent on degeneracy of the matrices Vl/ and F 
in isolation, whereas empirical Bayes is only sensitive 
to degeneracy of their summation. 

This distinction can also be observed in how the 
effective penalties on X and S, as imbedded in 
gEB{X, S] X), vary given fixed values of F or 4' re- 
spectively. For example, when "if is close to being full 
rank and orthogonal (such as when the algorithm is ini- 
tialized), then the implicit penalty on 5* is minimally 
non-convex (only slightly concave). In fact, as 5* be- 
comes large and orthogonal, the penalty converges to 
a scaled version of the £i norm. In contrast, as ^ be- 
comes smaller and low-rank, the penalty approaches 
a scaled version of the io norm, implying that max- 
imally sparse corruptions will be favored. Thus, we 
do not aggressively favor maximally sparse S until the 
rank has already been reduced and we are in the basin 
of attraction of a good solution. Of course no heuristic 
annealing strategy is necessary, the transition is han- 
dled automatically by the algorithm. 

Additionally, whenever is fixed, the resulting cost 
function formally decouples into n separate, canoni- 
cal sparse estimation problems on each Sj in isolation. 
With A = 0, it not difficult to show that each of these 
subproblems is equivalent to solving 



J Il2 



9eb{sj 



(39) 



where 



gEsiSj) = min V sjf-^s^ + log |$f,$^ + /| (40) 

is a concave sparsity penalty on Sj and $ is any ma- 
trix such that <^^<^^ = /j^ When $ is nearly orthog- 
onal, this problem has no local minima and a global 
solution that approximates the hard thresholding of 
the ^0 norm; however, direct min imization of the 
norm will have 2" local minima (I Wipf et al.l . l201l[) . 
In contrast, when $ is poorly conditioned (with ap- 
proximately low-r ank structure, it has been argued in 
Wipf et al.l (|2011l ) that penalties such as gEB{sj) are 
particularly appropriate for avoiding local minima. 

Something similar occurs when F is now fixed and 
we evaluate the penalty on X. This penalty ap- 
proaches something like a scaled version of the nu- 
clear norm (less concave) when elements of F are set 
to a large constant and it behaves more like the rank 
function when F is small. At initialization, when F 
is all ones, we are relatively free to move between 



solutions of various rank without incurring a heavy 
penalty. Later as F becomes sparse, solutions satisfy- 
ing Rank[X] + \\sj\\o < m for some j become heavily 
favored. 

As a final point, the proposed empirical Bayesian ap- 
proach can be implemented with alternative varia- 
tional bounds and possibly optimized with something 
akin to simultaneous reweighted nuclear and £i norm 
minimization, a perspective that naturally suggests 
further performance analyses such as those ap plied to 
sparse estimation in Wipf fc NagaraianI ( 2010[ ). 



5 EMPIRICAL RESULTS 

This section provides some empirical evidence for the 
efficacy of our RPCA method. First, we present com- 
parisons with POP recovering random subspaces from 
corrupted measurements. Later we discuss a photo- 
metric stereo application. In all cases we used the the 



augme nted lagrangian method (ALM) from iLin et al 



to implement PCP. This algorithm has efficient, 
guaranteed convergence and in previous empirical tests 
ALM has outperformed a variety of other methods in 
computing minimum nuclear norm plus £i norm solu- 
tions. 

5.1 RANDOM SUBSPACE SIMULATIONS 

Here we demonstrate that the empirical Bayesian al- 
gorithm from Section [3. 21 which we will refer to as EB, 
can recovery unknown subspaces from corrupted mea- 
surements in a much broader range of operating condi- 
tions compared to the convex PCP. In particular, for a 
given value of Rank[X], our method can handle a sub- 
stantially larger fraction of corruptions as measured by 
p — \\S\\Q/{nm). Likewise, for a given value of p, we 
can accurately e stimate an X with m uch higher rank. 
Consistent with ICandes et all (|201ll ). we consider the 
case where E = 0, such that all the error is modeled by 
S. This allows us to use the stable, convergent ALM 
code available online!^ 

The first experiment proceeds as follows. We gener- 
ate a low-rank matrix X with dimensions reflective of 
many computer vision problems: number- of-images x 
number- of -pixels. Here we choose m = 20 and n — 10*, 
the later dimension equivalent to a 100 x 100 pixel 
image. For each trial, we compute an to x n matrix 
with iid N{{), 1) entries. We then compute the SVD 
of this matrix and set all but the r largest singular 
values to zero to produce a low-rank X. S* is gen- 
erated with nonzero entries selected uniformly with 
probability p — 0.2. Nonzero values are sampled from 
an iid Uniform [-10, 10] distribution. We then compute 



''We have assumed here that is full rank. 



^http: //perception, csl -uiuc . edu/niatrix-raiik/| 



Y = X + S and try to estimate X and S using the EB 
and PCP algorithms. Estimation results averaged over 
multiple trials as r is varied from 1 to 10 are depicted 
in Figure [T] We plot normalized mean-squared error 

(MSE) as computed via <^||X - as well 

as the average angular error between the estimated and 
true subspaces. In both cases the average is across 10 
trials. 
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Figure 1: Estimation results where the corruption 
probability p is 0.2 and Rank[X]/TO is varied from 0.05 
to 0.50, meaning up to 50% of full rank. Top: Nor- 
malized mean-squared error (MSE). Bottom: Average 
angle (in degrees) between the estimated and true sub- 
spaces. 

From Figure [T] we observe that EB can accurately es- 
timate X for substantially higher values of the rank. 
Interestingly, we are also still able to estimate the cor- 
rect subspace spanned by columns of X perfectly even 
when the MSE of estimating X starts to rise (compare 
Figure [TJ Top) with Figure [IJiJottom)). Basically, this 
occurs because, even if we have estimated the sub- 
space perfectly, reducing the MSE to zero implicitly 
requires solving a challenging sparse estimation prob- 
lem for every observation column yj . For each column, 
this problem requires learning dj = Rank[X] + ||sj||o 
nonzero entries given only m = 20 observations. For 
our experiment, we can have dj > 14 with high prob- 
ability for some columns when the rank is high, and 
thus we may expect some errors in S (not shown). 
However, the encouraging evidence here is that EB 
is able to keep these corrupting errors at a minimum 
and estimate the subspace accurately long after PCP 
has failed. Moreover, if an accurate estimate of X is 



needed, as opposed to just the correct spanning sub- 
space, then a postprocessing error correction step can 
potentially be applied to each column individually to 
improve performance. 

The second experiment is similar to the first only now 
we hold Rank[X] fixed at 4, meaning Rank[X]/m = 
0.2, and vary the fraction of corrupted entries in S 
from 0.1 to 0.8. Figure [5] shows that EB is again able 
to drastically expand the range whereby successful es- 
timates are obtained. Notably it is able to recover the 
correct subspace even with 70% corrupted entries. 
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Figure 2: Estimation results with Rank[X]/m = 0.2 
and p varied from 0.1 to 0.8. Top: Normalized mean- 
squared error (MSE). Bottom: Average angle (in de- 
grees) between the estimated and true subspaces. 

As a final comparison, we tested PCP and EB on a 
400 X 400 observation matrix Y generated as above 
with a Rank[Ar]/m = 0.1 and p = 0.5. The estima- 
tion results are reported in Table [TJ PCP performs 
poorly since the normalized MSE is above one, mean- 
ing we would have been better off simply choosing 
X = in this regard. Additionally, the angular er- 
ror is very near 90 degrees, consistent with the error 
from a randomly chosen subspace in high dimensions. 
In contrast, EB provides a reasonably good estimate 
considering the difficulty of the problem. 

5.2 PHOTOMETRIC STEREO 

Photometric stereo is a method for estimating surface 
normals of an object or scene by capturing multiple 
images from a fixed viewpoint u nder different light- 
ing conditions ( Woodham . 1980l ). At a basic level, 



Table 1: Estimation results with m = n = 400, 
Rank[X]/m 0.1 and p = 0.5. 

PCP EB 

MSE (norm.) 1.235 0.066 
Angular Error 88.50 5.01 



across 5 trials are presented in Figure [31 The error 
metrics have been redefined to accommodate the pho- 
tometric stereo problem. We now define the normal- 
ized MSE as (^\\X - X\\^jr/\\X -YW^jrY which mea- 
sures how much improvement we obtain beyond just 
using the observation matrix Y directly. Similarly we 
normalized the angular error by dividing by the angle 
between Y and the true X for each trial. 



this methodology assumes a Lambertian object sur- 
face, point light sources at infinity, an orthographic 
camera view, and a linear sensor response function. 
Under these conditions, it has been shown that the 
intensities of a vectorized stack of images Y can be 
expressed as 



Y = NT, 



(41) 



where L is a 3 x to matrix of to normalized lighting 
directions, is a 3 x n matrix of surface normals at 
n pixel locations, and T is a, diagonal matrix of dif- 
fuse albedo values (IWoodhaml -HQBO). Thus, if we were 
to capture at least 3 images with known, linearly in- 
dependent lighting directions we can solve for N us- 
ing least squares. Of course in reality many common 
non-Lambertian effects can disrupt this process, such 
as specularities, cast or attached shadows, and image 
noise, etc. In many cases, these effects can be modeled 
as an additive sparse error term S applied to (|4T|) . 

tttld . 



As proposed in iWu et al 



we can estimate 
A^ by solving ([2]) assuming 
The resulting X, combined 



the subspace containing 
X = L'^NT and E ^ 0. 
with possibly other a priori information regarding the 
lighting d i rectio ns L can lead to an estimate of A^. 
Wu et al.l lioiS) propose using a modified version of 
PCP for this task, where a shadow mask is included to 
simplify the sparse error correction problem. However, 
in practical situations it may not always be possible to 
accurately locate all shadow regions in this manner so 
it is desirable to treat them as unknown sparse cor- 
ruptions. 

For this experiment we consider the synthetic Caesar 
image from the INRIA 3D Meshes Research Databasclll 
with known surface normals. Multiple 2D images 
with different known lighting conditions can easily be 
generated using th e Cook - Torrance reflectance model 



(jCook fc TorranceL 119811 ). These images are then 
stacked to produce Y. Because shadows are extremely 
difficult to handle in general, as a preprocessing step 
we remove rows of Y corresponding to pixel locations 
with more than 10% shadow coverage. Specular cor- 
ruptions were left unfiltered. We tested our algorithm 
as the number of images, drawn randomly from a batch 
of 40 total, was varied from 10 to 40. Results averaged 
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Figure 3: Estimation results using photometric stereo 
data as the number of images to is varied from 10 to 
40. Top: Normalized MSE (see altered definition in 
the main text). Bottom: Normalized angle between 
the estimated and the true subspaces. 

From Figure [3] it is clear that EB outperforms PCP in 
both MSE and angular error, especially when there are 
fewer images present. It is not entirely clear however 
why the MSE and angular error are relatively flat for 
EB as opposed to dropping lower as to increases. Of 
course these are errors relative to using Y directly to 
predict X, which could play a role in this counterintu- 
itive effect. 



6 CONCLUSIONS 

In this paper we have analyzed a new empirical 
Bayesian approach for matrix rank minimization in 
the context of RPCA, where the goal is to decompose 
a given data matrix into low-rank and sparse compo- 
nents. Using a variational approximation and subse- 
quent marginalization, we ultimately arrive at a novel 
rcgularization term that couples low-rank and spar- 



' |http : //www-roc . inria . fr /gamma/gamma/ download/ downlBk^. giqr^ alties in such a way that locally minimizing 



solutions arc effectively smoothed while the global op- 
timum matches that of the ideal RPCA cost function. 
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